67 research outputs found
Watch your Up-Convolution: CNN Based Generative Deep Neural Networks are Failing to Reproduce Spectral Distributions
Generative convolutional deep neural networks, e.g. popular GAN
architectures, are relying on convolution based up-sampling methods to produce
non-scalar outputs like images or video sequences. In this paper, we show that
common up-sampling methods, i.e. known as up-convolution or transposed
convolution, are causing the inability of such models to reproduce spectral
distributions of natural training data correctly. This effect is independent of
the underlying architecture and we show that it can be used to easily detect
generated data like deepfakes with up to 100% accuracy on public benchmarks.
To overcome this drawback of current generative models, we propose to add a
novel spectral regularization term to the training optimization objective. We
show that this approach not only allows to train spectral consistent GANs that
are avoiding high frequency errors. Also, we show that a correct approximation
of the frequency spectrum has positive effects on the training stability and
output quality of generative networks
Unsupervised Multiple Person Tracking using AutoEncoder-Based Lifted Multicuts
Multiple Object Tracking (MOT) is a long-standing task in computer vision.
Current approaches based on the tracking by detection paradigm either require
some sort of domain knowledge or supervision to associate data correctly into
tracks. In this work, we present an unsupervised multiple object tracking
approach based on visual features and minimum cost lifted multicuts. Our method
is based on straight-forward spatio-temporal cues that can be extracted from
neighboring frames in an image sequences without superivison. Clustering based
on these cues enables us to learn the required appearance invariances for the
tracking task at hand and train an autoencoder to generate suitable latent
representation. Thus, the resulting latent representations can serve as robust
appearance cues for tracking even over large temporal distances where no
reliable spatio-temporal features could be extracted. We show that, despite
being trained without using the provided annotations, our model provides
competitive results on the challenging MOT Benchmark for pedestrian tracking
Spectral Distribution Aware Image Generation
Recent advances in deep generative models for photo-realistic images have led
to high quality visual results. Such models learn to generate data from a given
training distribution such that generated images can not be easily
distinguished from real images by the human eye. Yet, recent work on the
detection of such fake images pointed out that they are actually easily
distinguishable by artifacts in their frequency spectra. In this paper, we
propose to generate images according to the frequency distribution of the real
data by employing a spectral discriminator. The proposed discriminator is
lightweight, modular and works stably with different commonly used GAN losses.
We show that the resulting models can better generate images with realistic
frequency spectra, which are thus harder to detect by this cue.Comment: Accepted at AAAI 2021 (conference version). Code:
https://github.com/steffen-jung/SpectralGA
Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness?
Recently, RobustBench (Croce et al. 2020) has become a widely recognized
benchmark for the adversarial robustness of image classification networks. In
its most commonly reported sub-task, RobustBench evaluates and ranks the
adversarial robustness of trained neural networks on CIFAR10 under AutoAttack
(Croce and Hein 2020b) with l-inf perturbations limited to eps = 8/255. With
leading scores of the currently best performing models of around 60% of the
baseline, it is fair to characterize this benchmark to be quite challenging.
Despite its general acceptance in recent literature, we aim to foster
discussion about the suitability of RobustBench as a key indicator for
robustness which could be generalized to practical applications. Our line of
argumentation against this is two-fold and supported by excessive experiments
presented in this paper: We argue that I) the alternation of data by AutoAttack
with l-inf, eps = 8/255 is unrealistically strong, resulting in close to
perfect detection rates of adversarial samples even by simple detection
algorithms and human observers. We also show that other attack methods are much
harder to detect while achieving similar success rates. II) That results on
low-resolution data sets like CIFAR10 do not generalize well to higher
resolution images as gradient-based attacks appear to become even more
detectable with increasing resolutions.Comment: AAAI-22 AdvML Workshop ShortPape
FrequencyLowCut Pooling -- Plug & Play against Catastrophic Overfitting
Over the last years, Convolutional Neural Networks (CNNs) have been the
dominating neural architecture in a wide range of computer vision tasks. From
an image and signal processing point of view, this success might be a bit
surprising as the inherent spatial pyramid design of most CNNs is apparently
violating basic signal processing laws, i.e. Sampling Theorem in their
down-sampling operations. However, since poor sampling appeared not to affect
model accuracy, this issue has been broadly neglected until model robustness
started to receive more attention. Recent work [17] in the context of
adversarial attacks and distribution shifts, showed after all, that there is a
strong correlation between the vulnerability of CNNs and aliasing artifacts
induced by poor down-sampling operations. This paper builds on these findings
and introduces an aliasing free down-sampling operation which can easily be
plugged into any CNN architecture: FrequencyLowCut pooling. Our experiments
show, that in combination with simple and fast FGSM adversarial training, our
hyper-parameter free operator significantly improves model robustness and
avoids catastrophic overfitting
SpectralDefense: Detecting Adversarial Attacks on CNNs in the Fourier Domain
Despite the success of convolutional neural networks (CNNs) in many computer
vision and image analysis tasks, they remain vulnerable against so-called
adversarial attacks: Small, crafted perturbations in the input images can lead
to false predictions. A possible defense is to detect adversarial examples. In
this work, we show how analysis in the Fourier domain of input images and
feature maps can be used to distinguish benign test samples from adversarial
images. We propose two novel detection methods: Our first method employs the
magnitude spectrum of the input images to detect an adversarial attack. This
simple and robust classifier can successfully detect adversarial perturbations
of three commonly used attack methods. The second method builds upon the first
and additionally extracts the phase of Fourier coefficients of feature-maps at
different layers of the network. With this extension, we are able to improve
adversarial detection rates compared to state-of-the-art detectors on five
different attack methods
Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches
In this work, we evaluate two different image clustering objectives, k-means
clustering and correlation clustering, in the context of Triplet Loss induced
feature space embeddings. Specifically, we train a convolutional neural network
to learn discriminative features by optimizing two popular versions of the
Triplet Loss in order to study their clustering properties under the assumption
of noisy labels. Additionally, we propose a new, simple Triplet Loss
formulation, which shows desirable properties with respect to formal clustering
objectives and outperforms the existing methods. We evaluate all three Triplet
loss formulations for K-means and correlation clustering on the CIFAR-10 image
classification dataset
Learning distributional token representations from visual features
In this study, we compare token representations constructed from visual features
(i.e., pixels) with standard lookup-based
embeddings. Our goal is to gain insight
about the challenges of encoding a text
representation from low-level features,
e.g. from characters or pixels. We focus on Chinese, which—as a logographic
language—has properties that make a representation via visual features challenging
and interesting. To train and evaluate different models for the token representation,
we chose the task of character-based neural machine translation (NMT) from Chinese to English. We found that a token
representation computed only from visual
features can achieve competitive results to
lookup embeddings. However, we also
show different strengths and weaknesses
in the models’ performance in a part-of-
speech tagging task and also a semantic
similarity task. In summary, we show that
it is possible to achieve a
text representation
only from pixels. We hope that this
is a useful stepping stone for future studies that exclusively rely on visual input, or
aim at exploiting visual features of written language
- …